Red Wine Exploration
by David Vartanian
Abstract
I describe a dataset with almost 1600 types of red wine, in order to understand the meaning of the assigned score.
Introduction
This dataset is provided by Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis, from different universities in Portugal. It provides information like acidity, residual sugar, chlorides, and alcohol among others. I explore the data to find patterns and trends and get the meaning of the given features. More information here.
Univariate Plots Section
Let’s start showing some summary numbers and first histograms to understand individual variables.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Histograms: quality, fixed.acidity, total.sulfur.dioxide, alcohol

These histograms show how the values are distributed in the different variables.
Outliers

There are a few outliers only on the right side.

There are several outliers only on the right side.

There are many outliers only on the right side.

There are just a few outliers only on the right side.

There are just a few outliers on the left side, and many on the right side.

There are many outliers on the right side.

All values are pretty well distributed in the pH variable. There are several outliers on both sides.

There are many outliers only on the right side.

This variable is also well distributed. There are several outliers on both sides.
Univariate Analysis
Dataset Structure
There are 9 continuous variables, 2 discrete variables and one ordered categorical variable: quality.
Main dataset interest
My general question is, how do chemical properties define the quality of the red wine?
There are interesting features in this dataset, each of them describing an important property of the red wine. Density, pH, sulphur dioxide, and sulphates are, in my opinion, the most important ones, in order to measure the quality. Let’s see what we can find by looking at those variables.
pH
This variable indicates the acidity level of the wine. The scale goes from 0 (very acid) to 14 (very basic). But most of red wines are between 3 and 4.

It’s quite surprising that levels of pH are lower on low-quality and high-quality wines.
Density of water
The level of this variable depends on alcohol percentage and sugar.

Density levels are also lower on low-quality and high-quality wines.
Free Sulphure Dioxide
The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion. It prevents microbial growth and the oxidation of the wine.

Again, the levels for this variable are lower for both low-quality and high-quality wines.
Sulphates
Additive contributing with sulphure dioxide gas (S02) levels, acting as an antimicrobial and antioxidant.

Sulphates levels are lower for low-quality and high-quality wines as well.
Bivariate Plots Section
Let’s try to find trends and interesting patterns by comparing two variables.

Fact: Higher quality wines seem to have higher levels of alcohol

Fact: Higher quality wines seem to have lower levels of acidity

Fact: Higher quality wines seem to have lower density

Citric Acid adds freshness flavor to the wine.

Level of acetic acid. Too high levels make an unpleasant vinegar taste.

Bivariate Analysis
Relationships

I’ve found a slightly positive correlation, meaning that density tends to be lower on high-quality wines. However, this correlation is not so important to determine the quality.

I’ve found that levels are mostly low for both variables. I would say that they don’t influence much on the quality as all types of wine have the same level of these two variables.

I’ve found the same here, as they keep levels constantly low.

Levels are always low. However, these two variables seem to be correlated.
Interesting relationships
So far I find only density to be an interesting variable to look at. The rest, chlorides, sulphates, residual sugar and sulphure dioxide don’t seem to be a great influence on wine quality.
Multivariate Plots Section

A quite strong correlation can be observed between these two variables, regarding the quality of wines. Meaning that it’s normal to find lower levels of pH and density on high-quality wines. The lines colours let you see how durable the wine can be respect to alcohol, using the durability variable introduced above. It makes sense to me that wines last longer if they contains more alcohol in addition to sulphates and free sulphure dioxide.

I’ve found here another interesting correlation, which becomes quite obvious if we pay special attention to the meaning of the variables. Density, as I said above, is actually density of water. So, the more alcohol the less water. The coloured lines show that the durability of the wine is lower when the density of water is higher. Does it make sense?
Multivariate Analysis
## quality mean_quality mean_alcohol mean_density
## Min. :3.00 Min. :3.00 Min. : 9.90 Min. :0.9952
## 1st Qu.:4.25 1st Qu.:4.25 1st Qu.:10.03 1st Qu.:0.9962
## Median :5.50 Median :5.50 Median :10.45 Median :0.9966
## Mean :5.50 Mean :5.50 Mean :10.72 Mean :0.9965
## 3rd Qu.:6.75 3rd Qu.:6.75 3rd Qu.:11.26 3rd Qu.:0.9970
## Max. :8.00 Max. :8.00 Max. :12.09 Max. :0.9975
## mean_ph mean_citric_acid n
## Min. :3.267 Min. :0.1710 Min. : 10.00
## 1st Qu.:3.294 1st Qu.:0.1915 1st Qu.: 26.75
## Median :3.312 Median :0.2588 Median :126.00
## Mean :3.327 Mean :0.2715 Mean :266.50
## 3rd Qu.:3.366 3rd Qu.:0.3498 3rd Qu.:528.25
## Max. :3.398 Max. :0.3911 Max. :681.00
Density & pH
Durability
Final Plots and Summary
Durability & Alcohol

Something very remarkable to keep in mind is what this plot shows: high-quality wines seem to last longer. But the orange line on the top-right corner makes a huge difference. They last much longer when alcohol level is higher.
Citric Acid vs. pH

This is a pretty straight forward correlation. When pH level gets lower (which means that there is more acid) citric acid gets higher. It makes sense, doesn’t it?
Density by Quality Level

I wanted to emphasize this plot again because levels of density look similar for both low-quality and high-quality wines. Or from another perspective, the density of water is higher only on mid-quality wines.
Reflection
I feel that now I have a few extra tips to select new wines to taste. Higher levels of alcohol and acidity, lower levels of density, as well as low levels of residual sugar, chlorides, and sulphates. High levels of alcohol and low level of density were definitely surprising for me. However, I think that the data set needs some more categorical variables and much more data to make better analysis.
For instance, adding columns with usual customers, sommeliers preferences, country of origin, types of grape, altitude of grape crops, and type of cask used to keep them before selling would be of great value to measure wine quality beyond the product itself, but also the background environment and production process.